BMC Bioinformatics — Latest Matching Preprints

1

RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2

Kapoor, B.; Cregger, M. A.; Ranjan, P.

2026-05-08 bioinformatics 10.64898/2026.05.05.723040 medRxiv

Top 0.1%

34.5%

Show abstract

MotivationAmplicon sequencing of 16S rRNA and internal transcribed spacer (ITS) gene regions is the most widely used approach for characterizing bacterial and fungal communities, respectively. The DADA2 pipeline has become a standard for inferring amplicon sequence variants (ASVs), offering single-nucleotide resolution over traditional OTU clustering. However, executing the full DADA2 workflow requires proficiency in R programming and manual coordination of multiple sequential steps, presenting a substantial barrier for researchers in clinical, environmental, and agricultural sciences who lack computational training. ResultsWe present RAPID (R-based Amplicon Pipeline for Interactive DADA2), a pair of R/Shiny applications providing complete graphical user interfaces for 16S rRNA and ITS amplicon sequence analysis. The 16S application implements a 10-step guided workflow from raw paired-end FASTQ files through quality filtering, error learning, dereplication, paired-read merging, chimera removal, taxonomy assignment (SILVA), phyloseq construction with data transformation (rarefaction, relative abundance, or CLR), interactive visualization (rarefaction curves, alpha diversity, NMDS, PCoA, taxonomic abundance), PERMANOVA, and ANCOM-BC2 differential abundance analysis. The ITS application extends this to an 11-step workflow, adding an automated primer removal step using cutadapt with support for multiple primers and length-variable amplicons, and uses the UNITE database for fungal taxonomy. Both applications feature asynchronous background processing, session persistence, real-time progress monitoring, publication-ready figure export, and comprehensive result downloads. AvailabilityRAPID is freely available at https://github.com/beantkapoor786/RAPID. Both applications can be installed locally on any system with R (version 4.0 or higher) and run as local web applications accessible through a standard browser.

2

SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis

Han, J.; Luo, W.; Baldwin, E.; Zhang, H. H.; An, L.; Liu, J.; Li, H.

2026-06-24 bioinformatics 10.64898/2026.06.18.733287 medRxiv

Top 0.1%

30.4%

Show abstract

MotivationWith rapid advances in sequencing technologies, many heterogeneous omics datasets have been generated, as seen in the Encyclopedia of DNA Elements (ENCODE) and many single-cell multi-omics sequencing projects, bringing substantial challenges to existing integrative methods. In this article, we report a novel multi-omics fusion and analysis software SEMFA which performs general parametric tests for the Mahalanobis Similarity of samples based on the factor scores generated by an Extended version of conventional Multiple Factor Analysis. ResultsOur developed method is effective and robust under both Gaussian and non-Gaussian assumptions. The mean F1 scores are over 0.8 when the column similarity level is 0.9 and the noise level ranges between 0.1 and 0.2, using simulation studies based on ENCODE count data. It was also efficient and effective at handling large-scale single-cell multi-omics data, as demonstrated in colon cancer cases as it unveiled signature network organization patterns of cells for stages III and IV.

3

A unified smoothing framework for protein domain bigram model

Cui, X.; Iyer, G.; Durand, D.

2026-06-18 bioinformatics 10.64898/2026.06.14.732219 medRxiv

Top 0.1%

26.6%

Show abstract

MotivationBiomolecular sequences can be represented as strings over an alphabet, an analogy that has motivated many applications of computational linguistic techniques to biological problems. However, such methods must be adapted to the characteristic scale and organization of biomolecular data. Here, we consider the problem of bigram smoothing for multidomain protein architectures, where domain bigram frequency data is extremely sparse and differs from textual data in alphabet size, string length distribution, the relationship between bigram and unigram frequencies, tandem repeat lengths, and the distribution of domain adjacencies. Moreover, some domain combinations are unobserved because they are biologically incompatible, others because the data are incomplete. A smoothing method that distinguishes these two cases is required. ResultsWe propose a unified smoothing framework based on interpolation that can be tuned to accommodate different bigram data characteristics. Within this framework, we design specific model variants suited to protein domain bigram data: these assign low adjusted counts to pairs that are likely incompatible, while making appropriate adjustments for undersampled pairs. We demonstrate empirically that this approach distinguishes the two cases while preserving the characteristic signatures of multidomain data. Availability and implementationImplementations of smoothing methods, the scripts used to generate all results presented in this paper, and the curated lists of extracellular and DNA-binding domains are available at https://codeberg.org/xcui297/protein-domain-smoothing.

4

StabCell: Stability selection for clustering and marker detection in single-cell RNA sequencing

Lück, N.; Rossi, A.; Staerk, C.

2026-05-12 bioinformatics 10.64898/2026.05.07.720061 medRxiv

Top 0.1%

25.9%

Show abstract

MotivationConventional pipelines for differential expression analysis in single-cell RNA sequencing (scRNA-seq) data first cluster individual cells and then test for differentially expressed genes between the resulting clusters. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not reflect true biological variation. ResultsWe introduce StabCell, a stability selection framework which integrates clustering and detection of differentially expressed marker genes. By repeatedly performing clustering and differential expression analysis on complementary random subsamples, StabCell assesses clustering and marker stability, yielding a stable clustering with sets of stable marker genes. In simulations, we demonstrate that StabCell provides approximate empirical per-family error rate (PFER) control, selecting fewer false positive marker genes compared with conventional approaches, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (IPSCs) to cardiomyocytes reveals that meaningful marker genes are consistently among the top-ranked genes. These results indicate that StabCell can improve the interpretability and robustness of scRNA-seq analyses. Availability and implementationAn implementation of StabCell in the statistical programming language R is available at https://github.com/LuckyLueck/StabCell. Code to reproduce the results is available at https://github.com/LuckyLueck/StabCell_paper.

5

HydraMPP: A lightweight library for distributed massive parallel processing in Python - threading at scale.

Figueroa, J. L.; White, R. A.

2026-06-08 bioinformatics 10.64898/2026.06.04.730204 medRxiv

Top 0.2%

18.4%

Show abstract

We now exist in the era of massive datasets from genomics, large language models, and all the known knowledge of humanity right at our fingertips. Much of this data is becoming more accessible; however, processing such data remains an ongoing issue across systems including high performance computing (HPC) infrastructures. Massively parallel computing (MPP) has solved this using a divide and conquer approach by splitting workloads across independent nodes (i.e., central processing units (CPU) allowing for higher scaling of data). The main engine for this in python is Ray; however, it has many issues including a large code space, security issues, debugging opacity, and memory management issues. Here, we present HydraMPP, a lightweight, ease of use and utilization, with high auditability, and with SLURM ergonomics.

6

geneML: Gene annotation across diverse fungal species using deep learning

Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.

2026-05-21 bioinformatics 10.64898/2026.05.18.725946 medRxiv

Top 0.2%

18.3%

Show abstract

Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.

7

Complex Indel Detection: A Simulation-Based Framework and Parsing with FreeBayes

Loh, Y. H. E.; Lieber, M. R.; Hsieh, C.-L.; Manojlovic, Z.

2026-05-29 bioinformatics 10.64898/2026.05.26.727999 medRxiv

Top 0.2%

17.9%

Show abstract

In contrast to simple deletions and simple insertions, most complex indels involve both deletions and insertions, often with base changes within a few nucleotides of the indels left and right boundaries. These complex indels often arise from double-strand breaks (DSB), which in normal somatic cells are predominantly repaired by nonhomologous DNA end joining (NHEJ). Such complex indels pose a difficult analytical problem for existing indel callers because the observed VCF representation may be locally shifted, extended with matching flanking bases, or fragmented into several closely spaced calls. To evaluate complex indel representation, we tested six variant calling approaches: FreeBayes, HaplotypeCaller, Mutect2, Strelka2, DRAGEN Germline, and DRAGEN Somatic pipelines. Among the approaches evaluated, FreeBayes most consistently represented simulated complex indels as single nearby variant records. We then developed a parsing workflow that derives effective deleted and inserted sequences from FreeBayes VCF output and enriches for candidate complex indels. This approach supports analysis of naturally occurring DSB repair events in single human colon crypts.

8

MOSAIC: Model-based, Subgroup-Aware Identification of Driver Mutations in Cancer

Campbell, K.; Reyna, M. A.

2026-05-03 bioinformatics 10.64898/2026.04.29.721672 medRxiv

Top 0.3%

15.1%

Show abstract

In cancer genomics, recurrent patterns of mutual exclusivity within a gene set can indicate shared biological context and involvement in tumorigenesis. However, existing methods are not designed to distinguish between mutual exclusivity arising from meaningful biological interactions from those influenced by heterogeneity between underlying patient subpopulations. In this work, we introduce MOSAIC, a novel statistical framework that models patient subgroup heterogeneity in mutual exclusivity analyses. In experiments with simulated data and real data from The Cancer Genome Atlas, we show that MOSAIC amplifies subgroup-specific mutual exclusivity signals, including between IDH1 and IDH2 in young low grade glioma patients, while reducing the effect of signals produced by underlying subgroup structures, such as distinct genomic lineages associated with histological subtypes of endometrial cancer. Finally, we demonstrate that MOSAIC is more powerful than existing p-value combination methods for patient subgroup stratification. MOSAIC is available as an open-source tool at https://github.com/reynalab/mosaic.

9

Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments

Chen, W.; Yang, C.; Qiu, L.; Hu, J.; Zhou, Y.

2026-06-25 bioinformatics 10.64898/2026.06.21.733121 medRxiv

Top 0.3%

14.9%

Show abstract

Summary: Long-read sequencing (LRS) has become essential for genome assembly, structural variations (SVs) detection, haplotype phasing and transcript isoform characterization. However, these applications often require manual inspection of read alignment for validation. Existing visualization tools are either interactive genome browsers that are difficult to scale to large datasets or batch-oriented tools that are not optimized for the unique alignment patterns of long-read data. We developed Bamsnap-LRS, an automated command-line tool for high-throughput LRS alignment visualization. It supports long-read-specific features, phased SNP inspection, and publication-ready batch figure generation within a unified framework for genomic, transcriptomic, and haplotype-aware analyses. Availability and Implementation: All codes and examples are freely available at https://github.com/comery/Bamsnap-LRS.

10

A novel method to select Reference Proteomes in UniProt

Raposo, P.; Martinez Marin, J. S.; Kim, G.; Insana, G.; Jyothi, D.; Luo, J.; Tunstall, T.; Consortium, U.; Orchard, S.; Steinegger, M.; Martin, M.

2026-05-14 bioinformatics 10.64898/2026.05.12.720148 medRxiv

Top 0.3%

14.9%

Show abstract

MotivationThe ongoing revolution in genome sequencing is delivering an unprecedented number of genome assemblies to global repositories, resulting in an overwhelming amount of data imported to UniProt in the form of proteomes. To manage this growth sustainably, there is a need for a systematic workflow to select the best proteomes. ResultsWe propose a novel pipeline for cellular organisms to select the best Reference Proteomes, i.e. those that best represent the protein space of a species. The pipeline uses a clustering algorithm based on MMseqs2 to select the minimum number of Reference Proteomes whilst maximising the representation of the protein space for each species. Additionally, we aligned our viral Reference Proteomes with the exemplar genome set defined by the International Committee on Taxonomy of Viruses. Because this method ensures that all species are represented with at least one Reference Proteome, the UniProt Knowledgebase increased the number of Reference Proteomes of 36% and covering 34% more species in the Tree of Life. The UniProt Knowledgebase will mainly retain proteins from Reference Proteomes and therefore this method reduces the overall number of proteins by 43%, leading to a more concise yet representative knowledgebase. Availability and Implementationhttps://www.uniprot.org/proteomes Contactraposo@ebi.ac.uk Supplementary informationSupplementary data are available at Bioinformatics online.

11

ROTS 2.0: A reproducibility-driven framework for robust statistical modeling across diverse high-throughput omics study designs

Suomi, T.; Kettunen, J.; Pusa, T.; Elo, L. L.

2026-06-03 bioinformatics 10.64898/2026.06.01.729164 medRxiv

Top 0.3%

14.9%

Show abstract

Reproducibility is fundamental to reliable scientific discoveries. The reproducibility-optimized test statistic (ROTS) is a robust framework designed to identify reproducible features (e.g. genes or proteins) in high-dimensional differential expression analyses such as transcriptomics and proteomics. This is achieved by optimizing the reproducibility of feature rankings under resampling. While originally implemented for univariate settings, ROTS now accommodates multi-group comparisons, survival analysis, linear models, and linear mixed-effects models, broadening its applicability to more complex and clinically relevant experimental designs. Using diverse simulations, benchmark datasets, and real-world case studies, we demonstrate the benefits of ROTS reproducibility optimization compared to the corresponding conventional test statistics. Additionally, we illustrate the utility of the reproducibility characteristics in assessing the overall reliability of the results. To facilitate widespread adoption, ROTS is provided as an open-source software package available through R/Bioconductor. Furthermore, to broaden the user base, we now also provide a Python interface available at pypi.org/project/PyROTS/.

12

BGC-QUAST: a quality assessment tool for genome mining software

Kushnareva, A.; Tupikina, D.; Almessady, H.; McHardy, A.; Gurevich, A.

2026-05-07 bioinformatics 10.64898/2026.05.04.722653 medRxiv

Top 0.3%

14.9%

Show abstract

SummaryBiosynthetic gene clusters (BGCs) encode microbial natural products, many of which have important ecological and biomedical roles. Genome mining tools enable large-scale BGC prediction, but their outputs differ substantially, complicating comparison and interpretation. We present BGC-QUAST, a framework for evaluating and comparing BGC predictions across three analysis modes: comparison across samples, assessment of BGC recovery in draft assemblies relative to reference genomes, and comparison of predictions from different tools using overlap analysis. BGC-QUAST provides standardized metrics, interactive visualizations, and integrated outputs for joint inspection of predictions, enabling the comprehensive comparison of genome mining results and facilitating sample prioritisation based on biosynthetic potential. Availability and implementationBGC-QUAST is publicly available at https://github.com/gurevichlab/bgc-quast

13

BAT: an integrated pipeline for gene tree construction, annotation, and functional inference

Sheppard, B. D.; Behnken, B.; Steinbrenner, A.

2026-05-12 bioinformatics 10.64898/2026.05.07.721474 medRxiv

Top 0.3%

14.7%

Show abstract

Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.

14

gTranslate: rapid and accurate translation table prediction for prokaryotic genomes

Chaumeil, P.-A.; Hugenholtz, P.; Parks, D. H.

2026-05-28 bioinformatics 10.64898/2026.05.24.727570 medRxiv

Top 0.3%

14.4%

Show abstract

BackgroundBioinformatic tools often require the prediction of protein-coding genes to make inferences about prokaryotic genomes. Typically, the genetic code used for translating genes to proteins must be specified by the user based on the taxonomic classification of a genome assembly or, for some widely used tools, established using a heuristic rule based on gene coding densities. Manual specification is at best inconvenient, but more challenging is that many bioinformatic tools are applied before taxonomic classifications have been established making specifying the translation table impractical. MethodsHere we provide a computationally efficient tool, gTranslate, that uses an ensemble of five machine learning methods to accurately predict translation tables for prokaryotic genomes. The feature vector used by gTranslate takes advantage of differences in gene coding densities when predicting genes under different translation tables along with features that consider the number and ratio of UGA stop codon reassignments to tryptophan or glycine. ResultsWe demonstrate that gTranslate correctly predicts the translation table of prokaryotic genomes >99.99% of the time (i.e. <1 error per 10,000 genomes) and outperforms a more computationally expensive prediction method and a coding density heuristic used by popular bioinformatic tools. Using gTranslate, we identify a basal lineage of Ca. Stammera capleta that uses the standard bacterial genetic code instead of the UGA stop codon to tryptophan reassignment common to other members of this species. We also identify the first instances of UGA-to-tryptophan reassignment in the Patescibacteriota making this the first bacterial phylum with members capable of using translation tables 4, 11, and 25.

15

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.3%

13.4%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

16

DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism

Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725487 medRxiv

Top 0.3%

13.3%

Show abstract

Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.

17

HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure

Cheng, Z.-R.; Chang, J.-M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725269 medRxiv

Top 0.3%

13.3%

Show abstract

Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.

18

CNSigs: An R Package for the Identification of Copy Number Mutational Signatures

Tallman, D.; Striker, S.; Byappanahalli, A. M.; Stockard, S.; Jenison, J.; Collier, K. A.; Blige, E.; Vater, M.; Stover, D. G.

2026-06-25 bioinformatics 10.64898/2026.06.21.733646 medRxiv

Top 0.3%

13.2%

Show abstract

BackgroundCopy number aberrations (CNAs) are gains and losses of large genomic segments present across most cancer types and are a hallmark of cancer genomic alterations. However, the processes underlying CNAs and characteristic patterns of CNAs are poorly understood. Bioinformatic advances have identified underlying single nucleotide variant (SNV) mutational signatures resulting from distinct mutational processes, yet development of algorithms able to uncover similar signatures for CNAs remains less advanced. MethodsUsing segmented data files from DNA sequencing, six copy number features are extracted for signature determination: segment size, breakpoints per 10 megabases, copy number oscillation events, average changepoint size, average copy number, and breakpoints per chromosome arm, along with ploidy. Mixed model approaches and non-negative matrix factorization (NMF) are utilized to derive CNA signatures across cancer types. The full methodology was packaged in a robust R package, termed CNSigs that is publicly available. ResultsTo verify the reproducibility of the signatures, we derived five signatures from two independent breast cancer datasets (total n>3000), demonstrating high accuracy (average cosine similarity = 0.89). Pan-cancer application of CNSigs in the TCGA dataset resulted in derivation of 13 pan-cancer signatures which were significantly associated with disease-specific survival. Benchmarking CNSigs to two other CNA signature approaches within TCGA demonstrated non-overlapping signatures and favorable compute speed for CNSigs. We evaluated n=24 pairs of tumor and circulating tumor DNA (ctDNA) acquired at the same time and demonstrated that CNSigs are detectable and reproducible via ctDNA, with significant association of CNSig11 with metastatic triple-negative breast cancer progression-free survival for taxane but not platinum or capecitabine chemotherapy. CNSigs association with immunophenotype was evaluated in low-grade glioma (LGG) and CNSig 3 was found to be highly prognostic for LGG yet complementary to immune features. ConclusionsThe CNSigs R package allows researchers to easily analyze their own samples to derive copy number signatures and evaluate clinical associations. We demonstrate potential application in ctDNA and association with treatment response. The development of this package allows further investigation of underlying processes that may be responsible for these CNA fingerprints.

19

COCOA.jl: A Julia package for high-performance analysis of concordance and kinetic modules in biochemical networks

Schaffranke, A.; Kueken, A.; Nikoloski, Z.

2026-05-08 systems biology 10.64898/2026.05.05.722856 medRxiv

Top 0.4%

13.2%

Show abstract

SummaryRecent advances in analysis of biochemical networks have contributed the identification of their modular structure based on the concept of multi reaction dependencies and kinetic coupling of reaction rates (Kuken et al., 2022; Langary et al., 2025). Existing implementations of the algorithms to study modular structure do not scale well with the size of the networks, prohibiting their application with genome-scale networks. Here, we introduce COCOA.jl, a multithreaded Julia package for identification of concordant and kinetic modules, with applications in the study of concentration robustness. Availability and implementationCOCOA.jl is implemented in Julia 1.12.2 and is freely available under the MIT license at https://github.com/antoniofranky/COCOA.jl. It runs on Linux, macOS, and Windows; installation is supported via the Julia package manager. COCOA.jl can be called from Python via JuliaCall. Contactantonschaf@posteo.de; ankueken@uni-potsdam.de

20

An extension of Modular Response Analysis for global perturbations and robust connectivity inference of gene regulatory networks.

Jimenez-Dominguez, G.; Audit, B.; Borgnat, P.; Ravel, P.; Arbona, J.-M.

2026-06-05 systems biology 10.64898/2026.06.02.729263 medRxiv

Top 0.4%

13.1%

Show abstract

Understanding how gene regulatory networks respond to global cell perturbations remains a central challenge in systems biology and network inference. Modular Response Analysis (MRA) provides a mathematical framework to infer gene-to-gene directed connectivity graphs from perturbation experiments; however, classical MRA captures direct gene-to-gene influences, and does not explicitly account for global stimuli that simultaneously change the graph. Here, we introduce MRA+, an extension of MRA, that incorporates the effect of global perturbations into gene-to-gene graph inference. MRA+ assumes a sequential experimental design in which targeted gene perturbations are followed by the application of a global stimulus, enabling the separation of connectivity changes from direct gene induction. The method estimates network connectivity under induced conditions and quantifies gene-specific induction strengths, which represent contributions to expression changes arising from mechanisms external to the inferred network. In the case of single-cell expression data, we present a bootstrap strategy to assess the robustness of inferred connectivity coefficients and propose a complementary criterion based on sign stability to interpret weak or non-significant estimates. Together, these developments provide a general framework for robust inference of gene connectivity graphs in the presence of global perturbations, applicable to diverse biological and experimental contexts.